Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop the MEET packet if the link node is in handshake state #1436

Merged
merged 2 commits into from
Dec 16, 2024

Conversation

enjoy-binbin
Copy link
Member

@enjoy-binbin enjoy-binbin commented Dec 13, 2024

After #1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink:

=== ASSERTION FAILED ===
==> '!link->node' is not true

In #778, we will call setClusterNodeToInboundClusterLink to attach the node to the link
during the MEET processing, so if we receive a another MEET packet in a short time, the
node is still in handshake state, we will meet this assert and crash the server.

If the link is bound to a node and the node is in the handshake state, and we receive
a MEET packet, it may be that the sender sent multiple MEET packets so in here we are
dropping the MEET to avoid the assert in setClusterNodeToInboundClusterLink. The assert
will happen if the other sends a MEET packet because it detects that there is no inbound
link, this node creates a new node in HANDSHAKE state (with a random node name), and
respond with a PONG. The other node receives the PONG and removes the CLUSTER_NODE_MEET
flag. This node is supposed to open an outbound connection to the other node in the next
cron cycle, but before this happens, the other node re-sends a MEET on the same link
because it still detects no inbound connection.

Note that in getNodeFromLinkAndMsg, the node in the handshake state has a random name
and not truly "known", so we don't know the sender. Dropping the MEET packet can prevent
us from creating a random node, avoid incorrect link binding, and avoid duplicate MEET
packet eliminate the handshake state.

After valkey-io#1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink:
```
=== ASSERTION FAILED ===
==> '!link->node' is not true
```

In valkey-io#778, we will call setClusterNodeToInboundClusterLink to attach the node to the link
during the MEET processing, so if we receive a another MEET packet in a short time, the
node is still in handshake state, we will meet this assert and crash the server.

If the link is bound to a node and the node is in the handshake state, and we receive
a MEET packet, it may be that the sender sent multiple MEET packets when reconnecting,
and in here we are dropping the MEET. Note that in getNodeFromLinkAndMsg, the node in
the handshake state has a random name and not truly "known", so we don't know the sender.
Dropping the MEET packet can prevent us from creating a random node, avoid incorrect
link binding, and avoid duplicate MEET packet eliminate the handshake state.

Signed-off-by: Binbin <[email protected]>
@enjoy-binbin
Copy link
Member Author

The ci fail link: https://github.com/valkey-io/valkey/actions/runs/12306207562/job/34347454716#step:3:4364

Runing in a loop can trigger the failiure
./runtest --single unit/cluster/cluster-reliable-meet --dont-clean --dont-pre-clean --loops 10

@pieturin could you take a look and see if i do it right?

Copy link
Contributor

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code flow has become too brittle and we are always in catchup mode. I think we should step back and reconsider the MEET flow.

@madolson / @PingXie / @enjoy-binbin

src/cluster_legacy.c Outdated Show resolved Hide resolved
@pieturin
Copy link
Contributor

Something else we could do, is to avoid sending MEET packets repeatedly. We can introduce a node field like last_meet_packet_sent and only send a MEET packet if the last packet was sent more than the handshake timeout ago here:

valkey/src/cluster_legacy.c

Lines 4978 to 4986 in 5f7fe9e

if (node->link != NULL && node->inbound_link == NULL && nodeInNormalState(node) &&
now - node->inbound_link_freed_time > handshake_timeout) {
/* Node has an outbound link, but no inbound link for more than the handshake timeout.
* This probably means this node does not know us yet, whereas we know it.
* So we send it a MEET packet to do a handshake with it and correct the inconsistent cluster view. */
node->flags |= CLUSTER_NODE_MEET;
serverLog(LL_NOTICE, "Sending MEET packet to node %.40s because there is no inbound link for it", node->name);
clusterSendPing(node->link, CLUSTERMSG_TYPE_MEET);
}

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay merging this and following up as well, given how consistently it causes issues in the CI.

@enjoy-binbin
Copy link
Member Author

i forget to mention if we double this line, we will get the crash in #778.

    clusterSendPing(link, nodeInMeetState(node) ? CLUSTERMSG_TYPE_MEET : CLUSTERMSG_TYPE_PING);
+    clusterSendPing(link, nodeInMeetState(node) ? CLUSTERMSG_TYPE_MEET : CLUSTERMSG_TYPE_PING);

Signed-off-by: Binbin <[email protected]>
@enjoy-binbin
Copy link
Member Author

i am merging it to unblock the CI, please feel free to update the comment or code if needed in #1441.

@enjoy-binbin enjoy-binbin merged commit e024b4b into valkey-io:unstable Dec 16, 2024
45 checks passed
@enjoy-binbin enjoy-binbin deleted the drop_dup_meet branch December 16, 2024 05:43
Copy link

codecov bot commented Dec 16, 2024

Codecov Report

Attention: Patch coverage is 50.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 70.79%. Comparing base (5f7fe9e) to head (9859060).
Report is 12 commits behind head on unstable.

Files with missing lines Patch % Lines
src/cluster_legacy.c 50.00% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1436      +/-   ##
============================================
- Coverage     70.90%   70.79%   -0.12%     
============================================
  Files           119      119              
  Lines         64631    64691      +60     
============================================
- Hits          45828    45799      -29     
- Misses        18803    18892      +89     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.67% <50.00%> (-0.20%) ⬇️

... and 28 files with indirect coverage changes

@PingXie
Copy link
Member

PingXie commented Dec 16, 2024

This code flow has become too brittle and we are always in catchup mode. I think we should step back and reconsider the MEET flow.

100% agreed. The entropy in the MEET flow has become increasingly unmanageable. Was the MEET protocol ever formally spec'd? I suspect not. It might be worthwhile to start by drafting a formal spec of the current behavior as a baseline.

@PingXie
Copy link
Member

PingXie commented Dec 16, 2024

Dropping newer MEET messages feels counterintuitive. While it may help with the assert, it seems to compromise resilience. In this situation, the node associated with the first MEET is likely stale. Could this lead to the MEET ultimately failing, effectively nullifying the retries on the sender side?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants